Counting common substrings effectively
نویسندگان
چکیده
This article presents effective (dynamic) algorithm for solving a problem of counting the number of substrings of given string which are also substrings of second string. Presented algorithm can be used for example for quick calculation of strings similarity measure using generalized ngram method (Niewiadomski measure [2]), which are shown. Correctness and complexity analyses are included. 1 Oznaczenia Jeśli przez w, s będą oznaczone słowa, to: • |w| długość słowa w (liczba liter), • litery w słowie w mają indeksy od 0 do |w| − 1, • w[a..b] (dla a, b ∈ N) podsłowo słowa w od litery a-tej do b-tej (włącznie) lub słowo puste gdy a > b, • w[a..b) (dla a, b ∈ N) podsłowo w[a..b− 1] słowa w, • w[a..) (dla a ∈ N) podsłowo w[a..|a|) (sufiks słowa w rozpoczynający się od a-tej litery), • w[a] (dla a ∈ N) a-ta litera słowa w (w[a] = w[a..a]), • ws konkatenacja słów w i s, tj. słowo długości |w| + |s|, takie, że (ws)[0..|w|) = w i (ws)[|w|..) = s; • zachodzi w = w[0..|w| − 1] = w[0..|w|) = w[0..) = w[0]w[1] . . . w[|w| − 1].
منابع مشابه
A Template Discovery Algorithm by Substring Amplification
In this paper, we consider to find a set of substrings common to given strings. We define this problem as the template discovery problem which is, given a set of strings generated by some fixed but unknown pattern, to find the constant parts of the pattern. A pattern is a string over constant and variable symbols. It generates strings by replacing variables into constant strings. We assume that...
متن کاملLinear-Space Substring Range Counting over Polylogarithmic Alphabets
Bille and Gørtz (2011) recently introduced the problem of substring range counting, for which we are asked to store compactly a string S of n characters with integer labels in [0, u], such that later, given an interval [a, b] and a pattern P of length m, we can quickly count the occurrences of P whose first characters’ labels are in [a, b]. They showed how to store S in O(n logn/ log logn) spac...
متن کاملForbidden substrings on weighted alphabets
In an influential 1981 paper, Guibas and Odlyzko constructed a generating function for the number of length n strings over a finite alphabet that avoid all members of a given set of forbidden substrings. Here we extend this result to the case in which the strings are weighted. This investigation was inspired by the problem of counting compositions of an integer n that avoid all compositions of ...
متن کاملEstimating phylogenetic distances between genomic sequences based on the length distribution of k-mismatch common substrings
Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between two input sequences. Haubold et al. (2009) showed how the average number of substitutions between two DNA sequences can be estimated based on the average length of exact common substrings. In this paper, we study the length distribution of k-mismatch common substrings betwee...
متن کاملGaKCo: A Fast Gapped k-mer String Kernel Using Counting
String Kernel (SK) techniques, especially those using gapped k-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size (Σ) or allow more mismatches (M). This is because current gk-SK uses a trie-based algorithm to calculate cooccurrence of mismatched subs...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1209.4771 شماره
صفحات -
تاریخ انتشار 2012